Multi-level Disambiguation Grammar Inferred from English Corpus, Treebank, and Dictionary
نویسندگان
چکیده
In this paper we will show that Grammatical Inference is applicable to Natural Language Processing. Given the wide and complex range of structures appearing in an unrestricted Natural Language like English, full Grammatical Inference, yielding a comprehensive syntactic and semantic definition of English, is too much to hope for at present. Instead, we focus on techniques for dealing with ambiguity resolution by probabilistic ranking; this does not require a full formal Chomskyan grammar. We giv e a short overview of the different levels and methods being investigated at CCALAS for probabilistic ranking of candidates in ambiguous English input.
منابع مشابه
Disambiguating Compound Nouns for a Dynamic HPSG Treebank of Wall Street Journal Texts
The aim of this paper is twofold. We focus, on the one hand, on the task of dynamically annotating English compound nouns, and on the other hand we propose disambiguation methods and techniques which facilitate the annotation task. Both the aforementioned are part of a larger on-going effort which aims to create HPSG annotation for the texts from the Wall Street Journal (henceforward WSJ) secti...
متن کاملSejong Korean Corpora in the Making
The 21st Century Sejong Project is a comprehensive project aiming to build various kinds of language resources including Korean corpora, comparable to BNC (Aston & Burnard, 1998), and Korean electronic dictionaries. The project was conceived of in 1997 and started in 1998 as a 10-year long-term project. By 2003, we completed 6 years of our work. The Sejong Corpora are a collection of raw corpor...
متن کاملTowards an LFG parser for Polish: An exercise in parasitic grammar development
While it is possible to build a formal grammar manually from scratch or, going to another extreme, to derive it automatically from a treebank, the development of the LFG grammar of Polish presented in this paper is different from both of these methods as it relies on extensive reuse of existing language resources for Polish. LFG grammars minimally provide two levels of representation: constitue...
متن کاملA new semantically annotated corpus with syntactic-semantic and cross-lingual senses
In this article, we describe a new sense-tagged corpus for Word Sense Disambiguation. The corpus is constituted of instances of 20 French polysemous verbs. Each verb instance is annotated with three sense labels: (1) the actual translation of the verb in the english version of this instance in a parallel corpus, (2) an entry of the verb in a computational dictionary of French (the Lexicon-Gramm...
متن کاملThe Interplay Between Lexical and Syntactic Resources in Incremental Parsebanking
Automatic syntactic analysis of a corpus requires detailed lexical and morphological information that cannot always be harvested from traditional dictionaries. In building the INESS Norwegian treebank, it is often the case that necessary lexical information is missing in the morphology or lexicon. The approach used to build the treebank is incremental parsebanking; a corpus is parsed with an ex...
متن کامل